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Abstract 

Background: Network concepts are increasingly used in biology and genetics. For example, the 
clustering coefficient has been used to understand network architecture; the connectivity (also 
known as degree) has been used to screen for cancer targets; and the topological overlap matrix 
has been used to define modules and to annotate genes. Dozens of potentially useful network 
concepts are known from graph theory. 

Results: Here we study network concepts in special types of networks, which we refer to as 
approximately factorizable networks. In these networks, the pairwise connection strength 
(adjacency) between 2 network nodes can be factored into node specific contributions, named 
node 'conformity'. The node conformity turns out to be highly related to the connectivity. To 
provide a formalism for relating network concepts to each other, we define three types of network 
concepts: fundamental-, conformity-based-, and approximate conformity-based concepts. 
Fundamental concepts include the standard definitions of connectivity, density, centralization, 
heterogeneity, clustering coefficient, and topological overlap. The approximate conformity-based 
analogs of fundamental network concepts have several theoretical advantages. First, they allow one 
to derive simple relationships between seemingly disparate networks concepts. For example, we 
derive simple relationships between the clustering coefficient, the heterogeneity, the density, the 
centralization, and the topological overlap. The second advantage of approximate conformity-based 
network concepts is that they allow one to show that fundamental network concepts can be 
approximated by simple functions of the connectivity in module networks. 

Conclusion: Using protein-protein interaction, gene co-expression, and simulated data, we show 
that a) many networks comprised of module nodes are approximately factorizable and b) in these 
types of networks, simple relationships exist between seemingly disparate network concepts. Our 
results are implemented in freely available R software code, which can be downloaded from the 
following webpage: http://www.genetics.ucla.edu/labs/horvath/ModuleConformity/ 

ModuleNetworks 



Background 

Network terminology is used to study important ques- 
tions in systems biology. For example, networks are used 
to study functional enrichment [1], to analyze the struc- 



ture of cellular networks [2], to model biological signal- 
ling or regulatory networks [1,3], to reconstruct metabolic 
networks [4], and to study the dynamic behavior of gene 
regulatory networks [5]. 
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Here we study the meaning of network concepts in rela- 
tively simple networks, e.g. gene co-expression networks 
and protein-protein interaction (PPI) networks. Specifi- 
cally, we consider undirected networks that can be repre- 
sented by a symmetric adjacency matrix A = [a^^, where 
the pairwise adjacency (connection strength) takes val- 
ues in the unit interval, i.e., 0 < < 1. For an unweighted 
network, the adjacency = 1 if nodes i and j are con- 
nected and 0 otherwise. For a weighted network, 0 < < 1. 
For notational convenience, we set the diagonal elements 
to 1. 

Fundamental network concepts 

Other authors refer to network concepts as network statis- 
tics or network indices. Network concepts include connec- 
tivity, mean connectivity, density, variance of the 
connectivity (related to the heterogeneity) etc. Network 
concepts can be used as descriptive statistics for networks. 
While some network concepts (e.g. connectivity) have 
found important uses in biology and genetics, other net- 
work concepts (e.g. network centralization) appear less 
interesting to biologists. Before attempting to understand 
why some concepts are more interesting than others, it is 
important to understand how network concepts relate to 
each other in biologically interesting networks. As a step 
toward this goal, we explore the meaning of network con- 
cepts in module networks, which are defined below. 

In the following, we review fundamental network con- 
cepts. Further details on the definitions and notations can 
be found in the Methods section. 

The node connectivity is given by 



where the function Sp{ • ) is defined for a vector v as Sp{v) 



Connectivity I = kf = ^^y. 



(1) 



In unweighted networks, the connectivity of node i 
equals the number of directly linked neighbors. In 
weighted networks, the connectivity equals the sum of 
connection weights with all other nodes. Highly con- 
nected 'hub' genes are thought to play an important role 
in organizing the behavior of biological networks [6-9]. 
Connectivity has been found to be an important comple- 
mentary gene screening variable for finding biologically 
significant genes in cancer [10,11] and primate brain 
development [12]. 

The line density [13] is defined as the mean off-diagonal 
adjacency and is closely related to the mean connectivity. 



Density = 



j^tV _ Si{k) _ mean{k) 



n{n - 1) n{n - 1) n - 1 



(2) 



= E,< =ivpyi. 

The normalized connectivity centralization (also known 
as degree centralization) is a simple and widely used index 
of the connectivity distribution. By definition [14], the 
normalized connectivity centralization is given by 



Centralization = - 



n-2 



maxffe) _ . ^ maxffe) _ 

^ - Density ~ ^ - Density. 

" ^ J n 

(3) 



n-1 



A frequent question of social network analysis concerns 
the causes and consequences of centralization in network 
structure, i.e. the extent to which certain nodes are far 
more central than others within the network in question. 
The centralization index has been used to describe struc- 
tural differences of metabolic networks [15]. 

Many measures of network heterogeneity are based on the 
variance of the connectivity, and authors differ on how to 
scale the variance [13]. Our definition of the network het- 
erogeneity equals the coefficient of variation of the con- 
nectivity distribution, i.e. 



^variance{k) _ InSjiJt) ^ 



Heterogeneity - — ^ = / "^^^"'^ _ i (4) 

meanik) \S^{kf 

This heterogeneity measure is scale invariant with respect 
to multiplying the connectivity by a scalar. Biological net- 
works tend to be very heterogeneous: while some 'hub' 
nodes are highly connected, the majority of nodes tend to 
have very few connections. Describing the heterogeneity 
(inhomogeneity) of the connectivity (degree) distribution 
has been the focus of considerable research in recent years 
[6,16-18]. 

The clustering coefficient of node i is a density measure 
of local connections, or 'cliquishness' [19,20]. Specifi- 
cally, 



ClusterCoefi 



(5) 



In unweighted networks, n^ equals twice the number of 
direct connections among the nodes connected to node i, 
and TT^ equals twice the maximum possible number of 
direct connections among the nodes connected to node f. 
Consequently, ClusterCoefi equals 1 if and only if all 
neighbors of i are also connected to each other. For gen- 
eral weighted networks with 0 < < 1, one can prove 0 < 
ClusterCoefi< 1 [21]. The relationship between the cluster- 
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A B 



Fly Protein-Protein Networic Yeast Protein-Protein Networic 




Yeast Co-expression Networlc: Soft Thresholding Yeast Co-expression Network: Hard Thresholding 




Colored by module membership Colored by module membership of Soft Thresholding 




Figure I 

Hierarchical clustering dendrogram and module definition. A) Drosophila PPI network. The dendrogram results from 
average linkage hierarchical clustering. The color-band below the dendrogram denotes the modules, which are defined as 
branches in the dendrogram. Of the 1 371 proteins, 862 were clustered into 28 proper modules, and the remaining proteins 
are colored in grey; B) yeast PPI network; C) weighted gene co-expression network (yeast); D) unweighted gene co-expres- 
sion network (yeast). To facilitate a comparison between the weighted and the unweighted gene co-expression networks, we 
used the module assignment of C) in D). Note that the colors of C) tend to stay together in D), which illustrates high module 
preservation. 
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ing coefficient and modular structure has been investi- 
gated by several authors [20,22-24]. 

The topological overlap between nodes i and j reflects 
their relative interconnectedness [20,25]. It is defined by 



TopOverlaPij = . rX^^ii ' (6) 

where = Tju^i^j^iu^uj- unweighted network, ^ equals 
the number of nodes to which both i and j are connected. 
In this case, TopOverlap^ = 1 if the node with fewer connec- 
tions satisfies two conditions: (a) all of its neighbors are 
also neighbors of the other node, and (b) it is connected 
to the other node. In contrast, TopOverlap^j = 0 if i and j are 
un-connected and the two nodes do not share any neigh- 
bors. By convention, TopOverlap^^ = 1. One can prove that 
0 < a^j < 1 implies 0 < TopOverlap^j ^ 1 [21]. 

The Topological Overlap Matrix Can Be Considered as Adjacency 
Matrix 

Since the matrix TopOverlap = [TopOverlapij] is symmetric 
and its entries lie in [0, 1], it satisfies our assumptions on 
an adjacency matrix. Roughly speaking, the topological 
overlap matrix can be considered as a 'smoothed out' ver- 
sion of the adjacency matrix. The elements of TopOverlap 
provide an alternative measure of connection strength 
based on shared neighbors. There is evidence that replac- 
ing A by TopOverlap may counter the adverse effects of 
spurious or missing adjacencies [25,26]. Since the adja- 
cency matrices of the PPI networks in our applications 
were very sparse, we replaced them by the corresponding 
topological overlap matrices. In contrast, we used the 
original adjacency matrix when analyzing gene co-expres- 
sion networks since high specificity is desirable for meas- 
uring interconnectedness in co-expression networks. 

The topological overlap matrix can be used for module 
defmition 

Our main interest lies in (sub-)networks comprised of 
nodes that form a module inside a larger network. Since a 
particular module network may encode a pathway or a 
protein complex, these special types of networks have 
great practical importance. Similar to the term 'cluster', no 
consensus on the meaning of the term 'module' seems to 
exist in the literature. In our applications, we use a cluster- 
ing procedure to identify modules (clusters) of nodes with 
high topological overlap. We follow the suggestion of [20] 
to turn the topological overlap matrix TopOverlap into a 
dissimilarity measure by subtracting it from 1, i.e. dissTop- 
Overlapij = 1 - TopOverlap^y 

We use dissTopOverlapij as input of average linkage hierar- 
chical clustering to arrive at a dendrogram (clustering 



tree) [27]. Modules are defined as the branches of the den- 
drogram. For example, in Figure 1 we show the dendro- 
grams of our network applications. Genes or proteins of 
proper modules are assigned a color (e.g. turquoise, blue 
etc). Genes outside any proper module are colored grey. 
Our module definition depends on how the branches are 
cut off the dendrogram. Several methods and criteria for 
identifying branches in a dendrogram have been pro- 
posed, see e.g. [20,21,28]. In practice, it is advisable to 
study how robust the results are with respect to alternative 
module detection methods. In our online R software tuto- 
rial, we show that our findings are highly robust with 
respect to alternative module definitions. In addition, we 
use a functional enrichment analysis of the resulting mod- 
ules to provide indirect evidence that the modules are bio- 
logically meaningful. Our module detection approach has 
led to biologically meaningful modules in several applica- 
tions [9,10,12,20,28-30] but we make no claim that it is 
optimal. Our theoretical results will apply to all module 
detection methods that result in approximately factoriza- 
ble networks. 

Results 

Conformity and factorizable networks 

We define an adjacency matrix A to be exactly factorizable 
if, and only if, there exists a vector CF with non-negative 
elements such that 

aij=CFfFj for all i^j (7) 
If the non-negative solution of equation (7) is unique, it 
is referred to as conformity vector CF and CF^ is the con- 
formity of node i. One can easily show that the vector CF 
is not unique if the network contains only n = 2 nodes. 
However, for n > 2 it is unique for a weighted network, see 
our derivations surrounding equation (20). 

We also define the concept of conformity for a general, 
non-factorizable network. The idea is to find an exactly 
factorizable adjacency matrix A(^p= CF CF^- diag{CF^) + I 
that best approximates A. Note that the diagonal elements 
of A(jp and A equal 1 . 

In the appendix, we define the conformity as a maximizer 
of the factorizability function 

(y)-^ 2 ■ ^It^^^^tiv^ methods of 

decomposing an adjacency matrix are briefly discussed 
below. 

In equation (43), we define a measure of network factor- 
izability as follows 
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r(A)-i 

The factorizability F(A) is normalized to take on values in 
the unit interval [0, 1]. The higher F(A), the better Af^p - 1 
approximates A - 1. 

Modules can be approximately factorizable 

Approximate factorizability is a very strong structural 
assumption on an adjacency matrix. It certainly does not 
hold for general networks. However, we provide empirical 
evidence that many clusters (modules) of genes or pro- 
teins in real networks are approximately factorizable. 
Table 1 reports the mean values of F(A) for the applica- 
tions considered in this paper. For example in the Dro- 
sophila PPI network, the mean factorizability F(A) is 0.82 
across 'proper' modules defined as clusters in the network. 
In contrast, the factorizability of the subnetwork com- 
prised of non-module nodes is only 0.17. In the yeast PPI 
network, the mean factorizability of proper modules is 
0.85 while it equals only 0.20 for the grey module. In the 
weighted yeast gene co-expression network, the mean fac- 
torizability of proper modules equals 0.73 while it is only 
0.18 for the improper module. Similarly in the 
unweighted yeast gene co-expression network, the mean 
factorizability of proper modules equals 0.62 while it is 
only 0. 1 1 for the improper module. A more detailed table 
presenting network concepts in each module is also pro- 
vided [see Additional file 1]. 

Our empirical results support the following 

Observation 1 For many modules defined with a clustering 
procedure, the subnetwork comprised of the module nodes is 
approximately factorizable. 



This observation motivates us to study network concepts 
in approximately factorizable networks. 

Conformity-based network concepts 

We refer to the standard network concepts known from 
the literature dis fundamental network concepts. In general, 
fundamental network concepts are functions of the off- 
diagonal elements of the adjacency matrix A. More pre- 
cisely, we use network concept functions to define different 
types of network concepts depending on the input matrix 
(see Table 2 and equation (21)). For example, when 
inputting an adjacency matrix with its diagonal elements 
replaced by 0, one arrives at fundamental network con- 
cepts (see Definition 5 in the Methods section). When 
inputting the conformity-based (CF-based) adjacency 
matrix A^^p with its diagonal elements replaced by 0, one 
arrives at CF-based network concepts (see Definition 6 in 
the Methods section). The conformity vector can be used 
to define the approximate CF-based matrix 

Acp,,pp=CFCF^=[CFfFj]. 

Note that the i-th diagonal element of A^^p^^^p equals CF^ . 
When Aqp^^pp is used as input of a network concept func- 
tion, one arrives at an approximate CF-based concept (see 
Definition 7 in the Methods section). 

We will demonstrate that approximate CF-based concepts 
satisfy simple relationships. Below, we show that these 
simple relationships carry over to fundamental network 
concepts in approximately factorizable networks. 

In Definition 7, we provide a formula for calculating 
approximate CF-based analogs of the fundamental net- 
work concepts. Specifically, we find 



Table I : Summary of fundamental network concepts in real network applications. 



Fly Protein Yeast Protein Yeast (Weighted) Yeast (Unweighted) 

Concept Proper Grey Proper Grey Proper Grey Proper Grey 



Factorizability 


.82 (.086) 


.170 


.85 (.100) 


.200 


.73 (.084) 


.180 


.62 (.130) 


.1 10 


Density 


.21 (.074) 


.017 


.28 (.120) 


.026 


.08 (.056) 


.005 


.40 (.150) 


.024 


Centralization 


.18 (.09!) 


.052 


.20 (.055) 


.036 


.10 (.026) 


.02! 


.4! (.110) 


.140 


Heterogeneity 


.35 (.130) 


.460 


.36 (.140) 


.430 


.56 (.066) 


.580 


.5! (.097) 


.830 


Mean Cluster Coef. 


.28 (.110) 


.050 


.36 (.120) 


.093 


.13 (.072) 


.032 


.72 (.087) 


.370 


Mean Conformity 


.45 (.076) 


.130 


.5! (.120) 


.150 


.26 (.084) 


.062 


.63 (.100) 


.120 



Each network contained several proper modules. Non-module genes were grouped into a single (improper) grey module. For each concept, we 
report the mean and standard error across the proper modules. A more detailed table presenting network concepts in each module is also 
provided [see Additional file I]. 
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'^CF,app,i 

DmsitycF,app 
CentralizatioriQp ^^pp 

Heterogeneity cF^app 

ClusterCoefcF,app,i 
TopOverlapcF^appAj 



Table 2: Brief overview of different types of networl< concepts. 



n{n - 1) 



Si(Cf) 



jCF) ( 



max{CF) - 



Si(CF) 



{n-l){n-2)[ 



nS2iCF) 



S2(CF) f 
Si(CF) J ' 



-1, 



CJ^CFj(S2(CF) + l) 



min(CJ^ , CFj (CF) + 1 - CFfiFj 



(8) 



where Sp(CF) = Zi(CFj)^'. Note that the approximate CF- 
based clustering coefficient does not depend on the i- 
index. This is why we sometimes omit this index and sim- 
ply write ClusterCoefcp^pp. 

Approximate CF-based network concepts satisfy simple 
relationships 

Here we demonstrate a major advantage of approximate 
CF-based network concepts: they exhibit simple relation- 
ships. Using the fact that Si[k(jp^^pp) = S^[CFy, and the 
approximation - 1) « 1, equations (8) imply the fol- 
lowing relationship 

JlClusterCoefcp 

or equivalently, 



ClusterCoefcF,app,i ^ (1 + Heterogeneity cF,appf ^Density cF,app- 

(9) 

Further, it is straightforward to derive a simple relation- 
ship between approximate CF-based topological overlap, 
connectivity and heterogeneity under the following mild 

1 1 - Ci^CF- 
assumptions: ~ 0 and ~ 0 . 

Specifically, we find 



min(Cl^,CF^)Si(CF) 



TopOverlapcp^^ppij ~ m3iK{CFi,CFj 



S^[CF) _ max(CfiSi(CF),CF,Si(CF)) nSjjCF) 



Si(CF) 
^^ikcF,app,i,kcF,app,j] 



[I + Heterogeneity cF,app)- 



S,[CFY 



(10) 



Input Matrix 



A-l 



Type of Example: Connectivity 
Concept 

fundamental Connectivityj(A - 1) 



AcF-l = CF CF^- diag(CF2) CF-based 



Aa,app=CFCFr 



Connectivity j(A(-p - /) 

app roxi mate Connectivity j(A(-p ^pp) 
CF-based = CFiZfFj ' 



A network concept arises by evaluating a netv^ork concept function on a 
special type of input matrix. We assume that the diagonal elements of 
the matrix A - 1 are 0. 



In the following subsection, we outline the conditions 
when equations (9) and (10) hold approximately for fun- 
damental network concepts in approximately factorizable 
module networks. 

Relating fundamental- to approximate CF-based concepts 

In the Methods section, we provide a heuristic argument 
for the following 

Observation 2 In approximately factorizable networks, funda- 
mental network concepts are approximately equal to their 
approximate CF-based analogs, 

FundamentalNetworkConcept ^ NetworkConcept^p ^pp. 

The observation implies that in approximately factoriza- 
ble networks, Connectivity Connectivity Qp ^pp and Density » 
Density Qp^^pp, etc. Observation 2 is illustrated for network 
density, centralization, heterogeneity, and clustering coef- 
ficients in Figure 2 (Drosophila PPI network), Figure 3 
(yeast PPI network), and Figure 4 (weighted and 
unweighted yeast gene co-expression networks; density is 
not included due to limited space). A consequence of this 
observation is that the simple relationships satisfied by 
approximate CF-based network concepts also apply to 
their corresponding fundamental network concepts in 
approximately factorizable networks. In particular, equa- 
tions (9) and (10) imply the following 

Observation 3 In approximately factorizable networks, 
the following relationships hold among fundamental net- 
work concepts 



mean[ClusterCoef) » (1 + Heterogeneity^Y x Density, 



(11) 



and 



TopOverlapij 



max(fe^-,fe.) 2 
— {I + Heterogeneity ). 



(12) 
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Observation 3 is important since it highlights the fact that 
seemingly disparate network concepts satisfy simple and 
intuitive relationships in approximately factorizable net- 
works. Equations (11) and (12) are illustrated in Figure 5 
(Drosophila PPl network), Figure 6 (yeast PPl network), 
and Figure 7 (weighted and unweighted yeast gene co- 
expression networks; TOM plots are not included due to 
limited space). Equation (12) has several important con- 
sequences. To begin with, it illustrates that the topological 
overlap between the most highly connected node and all 
other nodes is approximately constant. Specifically, if we 
denote the index of the most highly connected node by 
[1] and its connectivity by k^^ = max{k), then 



TopOverlap^i^j ~ (1 + Heterogeneity^ ) . (13) 



As an aside, we briefly mention that TopOverlap^^j has a 
simple interpretation in terms of the hierarchical cluster- 
ing dendrogram that results from using dissTopOverlapij = 
1 -TopOverlapij as input. In this case, TopOverlap^^^ is 
related to the longest branch length in the dendrogram. 




Approximate CF-Based Density 



Approximate CF-Based Centralization 



Approximate CF-Based Heterogeneity 



Approximate CF-Based Clustering Coefficients 



Figure 2 

Drosophila PPl module networks: the relationship 
between fundamental network concepts Network- 
Concept{A - /) (y-axis) and their approximate CF- 
based analogs NetworkConceptQp^pp (x-axis). This figure 
demonstrates Observation 2. A) Density versus Density Q^^pp, 
B) Centralization versus Centralization (^p^pp, C) Heterogeneity 
versus Heterogeneityf^p^pp; D) Intramodular clustering coeffi- 
cients C/usterCoefy versus ClusterCoef^^p^pp. In Figures A), B) 
and C), each dot corresponds to a mooule since these net- 
work concepts summarize an entire network module. In Fig- 
ure D), each dot corresponds to a node since these network 
concepts are node specific. A reference line with intercept 0 
and slope I has been added to each plot. 




Approximate CF-Based Density 




* * 


4:-' 


Approximate CF- 


Based Centralization 
2=0.7 















Approximate CF-Based Heterogeneity 



Approximate CF-Based Clustering Coefficients 



Figure 3 

Yeast PPl module networks: the relationship 
between fundamental network concepts Network- 
Concept(A - /) (y-axis) and their approximate CF- 
based analogs NetworkConcept^p^pp (x-axis). This figure 
demonstrates Observation 2. A) Density versus Density ^^p^pp, 
B) Centralization versus Centralization (^p^^pp, C) Heterogeneity 
versus Heterogeneity^p^pp, D) Intramodular clustering coeffi- 
cients C/usterCoefy versus ClusterCoefCp^pp. In Figures A), B) 
and C), each dot corresponds to a module since these net- 
work concepts summarize an entire network module. In Fig- 
ure D), each dot corresponds to a node since these network 
concepts are node specific. A reference line with intercept 0 
and slope I has been added to each plot. 



In the following, we relate TopOverlap^^^ to the funda- 
mental network concept Centralization. According to 
max(fe) 



equation (3), 



' Centralization + Density. Substi- 



tuting this expression in equation (13) implies 

TopOverlap^iy^ [Centralization + Density) 
( 1 + Heterogeneity^ ) (14) 

Equation (14) is illustrated in Figure 5 (Drosophila PPl 
network), Figure 6 (yeast PPl network), and Figure 7 
(weighted and unweighted yeast gene co-expression net- 
works). 

In factorizable networks, fundamental network concepts 
are simple functions of the connectivity 

Here we demonstrate another advantage of approximate 
CF-based network concepts. They allow one to relate 
fundamental network concepts to simple functions of 
the connectivity. Toward this end, note the following 
simple relationship between the conformity CF and the 
approximate CF-based connectivity liZcF,app- 
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p=7, RA2=0.99 



tau=0.65, RA2=0.97 



Approximate CF-Based Centralization 



Approximate CF-Based Centralization 




(1 +Heterogeneity''2)'^2*Density 



(Centralization+Density)*(1+Heterogeneity'^2) 



p=7 



Approximate CF-Based Heterogeneity 



p=7, R'^2=0.82 



Approximate CF-Based Heterogeneity 



tau=0.65, R'^2=0.6 




Approximate CF-Based Clustering Coefficients 



Approximate CF-Based Clustering Coefficients 



Figure 4 

Yeast gene co-expression module networks: the rela- 
tionship between fundamental network concepts 
NetworkConcept{A - /) (y-axis) and their approximate 
CF-based analogs NetworkConcept^p^pp (x-axis). This 
figure demonstrates Observation 2. A reference line with 
intercept 0 and slope I has been added to each plot. The fig- 
ures on the left (right) hand side depict network concepts 
from the weighted (unweighted) network. A) and B) Central- 
ization versus Centralization ^^p^^pp; C) and D) Heterogeneity 
versus Heterogeneity^fi^pp; E) and F) Intramodular clustering 
coefficients C/usterCoe^ vers us ClusterCoef^p^pp. The analogous 
plots for Density are not presented since the fundamental 
network concepts and their approximate CF-based analogs 
are almost identical and the dots fall near the reference line 
with = I for both weighted and unweighted networks, and 
thus are omitted due to limited space. In Figures A), B), C) 
and D), each dot corresponds to a module since these net- 
work concepts summarize an entire network module. In Fig- 
ure E) and F), each dot corresponds to a node since these 
network concepts are node specific. 







Figure 5 

Drosophila PPI module networks: the relationship 
between fundamental network concepts. This figure 
demonstrates Observation 3 and equation (14). In Figures A) 
and B), each point is a protein colored by its module assign- 
ment, and the red line has intercept 0 and slope I . Figure A) 
illustrates the relationship between the mean clustering coef- 
ficient (short horizonal line) and (I + Heterogeneity^y^ Den- 
sity (equation (I I)). Figure B) illustrates the relationship 
between the topological overlap with the hub node and {Den- 
sity + Centralization) * (I + Heterogeneity^) (equation (14)). 
Figure C) is a color-coded depiction of the topological over- 
lap matrix TopOverlapjj 'in the turquoise module network. Fig- 
ure D) represents the corresponding approximation 
mox(/c-,/Cj)(l + Heterogeneity^)/ n (equation (12)). Figures E) and 
F) are their analogs for the brown module. The turquoise 
and the brown module represent the largest and third largest 
module. Analogous plots for the other modules can be found 
in our online supplement. 



CF- 



(16) 



This equation shows that conformity can be interpreted as 
a scaled connectivity in approximately factorizable net- 
works. Since approximate CF-based network concepts are 

simple functions of the conformity, substituting , 

7Si(fe) 

for CF implies that approximate CF-based concepts can be 
approximated by simple functions of the connectivity. For 
example, we find the following simple expressions for the 
cluster coefficient and the topological overlap. 



^CF,app,i 



(15) 



Since in approximately factorizable networks k^^ i » k^, we 
find that the conformity CF is approximately given by the 
scaled connectivity, i.e. 



Observation 4 

ClusterCoefi 
TopOverlapij 



(Si(k))^ ' 

kkj(S2{k) + S,ik)) 
min(fej,fe)Si(k) + Si(k)- 



max{ki,kj) S2(k) 
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p=7, R'^2=0.82 



tau=0.65, R'^2=0.6 





(1 +Heterogeneity'^2)''2*Density 



{Centralization+Density)*(1+Heterogeneity'^2) 



(1 +Heterogeneity''2)'^2*Density 



(1 +Heterogeneity'^2)'^2*Density 







p=7, R'^2=0.81 



tau=0.65, R'^2=0.74 



Figure 6 

Yeast PPI module networks: the relationship 
between fundamental network concepts. This figure 
demonstrates Observation 3 and equation (14). In Figures A) 
and B), each point is a protein colored by its module assign- 
ment and the red line has intercept 0 and slope I . Figure A) 
illustrates the relationship between the mean clustering coef- 
ficient (short horizonal line) and (I + Heterogeneity^y^ Den- 
sity (equation (I I)). Figure B) illustrates the relationship 
between the topological overlap with the hub node and {Den- 
sity + Centralization) * (I + Heterogeneity^) (equation (14)). 
Figure C) is a color-coded depiction of the topological over- 
lap matrix TopOverlapjj ln the turquoise module network. Fig- 
ure D) represents the corresponding approximation 
mox(/c,,/Cj)(l + Heterogeneity^)/ n (equation (12)). Figures E) and 
F) are their analogs for the brown module. The turquoise 
and the brown module represent the largest and third largest 
module. Analogous plots for the other modules can be found 
in our online supplement. 



where the last approximation assumes 



Si(k) 
S2(k) 



~ 0 and 



S^(k)-kikj 
mm[ki,kj)Si{k) 



-0 



Protein-protein interaction and gene co-expression 
network applications 

Drosophila and yeast protein-protein network 
To illustrate our results, we computed network concepts 
in module networks based on Drosophila and yeast pro- 
tein-protein interaction (PPI) networks downloaded from 
BioGrid [31]. As described before, we defined the mod- 
ules as branches of the hierarchical clustering dendro- 
gram, see Figure 1 . 

Of the 1371 proteins in the Drosophila PPI network, 862 
were clustered into 28 modules, and the remaining pro- 
teins grouped into an improper (grey) module. The mod- 
ule sizes of the proper modules range from 10 to 96, mean 
30.79, median 23, and interquartile range 24. 





(Centralization+Density)*(1+Heterogeneity''2) 



(Centralization+Density)*(1+Heterogeneity''2) 



Figure 7 

Yeast gene co-expression module networks: the rela- 
tionship between fundamental network concepts. 

This figure demonstrates Observation 3 and equation (14). 
The figures on the left (right) hand side depict network con- 
cepts from the weighted (unweighted) network. Each point is 
a gene colored by its module assignment. The red line has 
intercept 0 and slope I . Figures A) and B) illustrate the rela- 
tionship between the mean clustering coefficient (short hori- 
zonal line) and (I + Heterogeneity^)^ ^ Density (equation (I I)). 
Figure C) and D) illustrates the relationship between the top- 
ological overlap with the hub node and {Density + Centraliza- 
tion) * (I + Heterogeneity'^) (equation (14)). 



Of the 2292 proteins in the yeast PPI network, 2050 were 
clustered into 44 proper modules, and the remaining pro- 
teins grouped into an improper module. The module sizes 
of the proper modules range from 10 to 219, mean 46.59, 
median 24, and interquartile range 38.8. 

Yeast gene co-expression networks 

We now illustrate our theoretical results using gene co- 
expression networks that have been used by many 
authors, e.g. [11,21,32]. Gene co-expression networks are 
constructed on the basis of microarray data from the tran- 
scriptional response of cells to changing conditions. There 
is evidence that genes with similar expression profiles are 
more likely to encode interacting proteins [33,34]. 

In gene co-expression networks, nodes correspond to gene 
expression profiles. The corresponding adjacency matrix 
is determined from a measure of co-expression between 
the genes. In the examples below, we will use the absolute 
value of the Pearson correlation coefficient between the 
gene expression profiles to measure co-expression 
between gene pairs. As detailed at the end of the Methods 
section, one can transform the Pearson correlation matrix 
into an adjacency matrix by hard thresholding or soft 
thresholding. Hard thresholding results in an unweighted 
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network and soft thresholding results in a weighted net- 
work [21]. We applied our methods to a yeast cell q^cle 
microarray data comprised of 44 microarrays and 2001 
genes. This dataset recorded gene expression levels during 
diff^erent stages of cell cycles in yeasts and has been widely 
used before to illustrate clustering methods [35]. 

Of the 2001 genes (microarray probesets) in the weighted 
yeast gene co-expression network, 1081 were clustered 
into 8 proper modules. The module sizes of the proper 
modules range from 53 to 308, mean 135.1, median 
101.5, and interquartile range 69.3. To facilitate a com- 
parison between the weighted and the unweighted gene 
co-expression networks, we used the module assignment 
of the weighted network for the unweighted network as 
well. It turns out that the module assignment is highly 
preserved between the weighted and the unweighted gene 
co-expression networks, see Figures IC) and ID). 

Functional annotation of modules 

Since the scope of this paper is a mathematical and topo- 
logical analysis of module networks, we defined modules 
without regard to external gene ontology information. 
Also we do not provide an in-depth analysis of the biolog- 
ical meaning of the network modules. But we briefly men- 
tion that there is indirect evidence that most of the 
resulting modules are biologically meaningful. We used 
the functional gene annotation tools from the Database 
for Annotation, Visualization and Integrated Discovery 
(DAVID) [36] to test for both enriched biochemical path- 
ways and subcellular compartmentalization. We find that 
most modules are significantly enriched with known gene 
ontologies. A functional enrichment analysis for each net- 
work application is provided. For the Drosophila PPI net- 
work, [see Additional file 3]; for the yeast PPI network, 
[see Additional file 4]; for the weighted and unweighted 
yeast gene co-expression networks, [see Additional file 5]. 

Empirical relationships in 4 different networks 
In accordance with Observation 2, we find a close rela- 
tionship {R^ > 0.6) between the fundamental network 
concepts and their approximate CF-based analogs. Specif- 
ically, we relate the network density, centralization, heter- 
ogeneity and clustering coefficients to their approximate 
CF-based analogs in Figures 2 (Drosophila PPI network). 
Figure 3 (yeast PPI network), and Figure 4 (weighted and 
unweighted yeast gene co-expression networks). 

In accordance with Observation 3, we find a close rela- 
tionship [R^ > 0.6) between the mean clustering coeffi- 
cient mean [ClusterCoef ) and (1 + Heterogeneity^y x 
Density. Further, we find a close relationship between Top- 
Overlap^^y and [Centralization + Density)[l + 
Heterogeneity^), see Figures 5 (Drosophila PPI network). 
Figure 6 (yeast PPI network), and Figure 7 (weighted and 
unweighted yeast gene co-expression networks). 
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We find that our theoretical observations fit better in the 
weighted- than in the unweighted yeast gene co-expres- 
sion network. 

Network concepts and module size 

Since the number of genes inside a module (module size) 
varies greatly among the modules, it is natural to wonder 
whether the reported relationships between network con- 
cepts are due to the underlying module sizes. We find that 
the relationship between fundamental network concepts 
and their approximate CF-based analogs remains highly 
significant even after correcting for module sizes [see 
Additional file 2]. The same holds for the relationships 
between network concepts. Thus, none of the reported 
relationships is trivially due to module sizes. But we find 
that many network concepts depend on the underlying 
module size. We find that large modules are less factoriz- 
able than small modules: there is a strong negative corre- 
lation between module factorizability F(A) and module 
size. We also find that fundamental network concepts 
(e.g. density) depend on module size in our applications. 
For the factorizability, density, centralization, heterogene- 
ity and mean clustering coefficient, the correlation coeffi- 
cients with module size are -0.84, -0.46, -0.17, 0.26, and - 
0.36 in Drosophila PPI module networks; they are -0.55, 
-0.52, 0.05, 0.5, and -0.44 in yeast PPI module networks; 
they are -0.93, -0.52, -0.82, 0.27, and -0.55 in weighted 
yeast gene co-expression module networks; they are -0.86, 
-0.77, -0.56, 0.87, and -0.85 in unweighted yeast gene co- 
expression module networks. A more detailed analysis is 
presented in the Additional files [see Additional file 2]. 

A simple exactly factorizable network example: constant 
network 

A simple, exactly factorizable network is given by an adja- 
cency matrix A with constant adjacencies (a^ = h, h e (0, 
1]). The adjacency matrix is exactly factorizable since = 
CF^CFj where CF^ = -Jh . This network can be interpreted as 
the expected adjacency matrix of an Erdos-Renyi network 
[37]. One can easily derive the following expressions for 
the fundamental network concepts: Connectivity ^ ={n-l)h, 
Density = b, Centralization = 0, Heterogeneity = 0, Cluster- 
Coefi = h and TopOverlapij = b. 

Since A is exactly factorizable, the fundamental network 
concepts equal their CF-based analogs. However, the 
approximate CF-based concepts are different from their 
exact counterparts, see Table 3. For reasonably large values 
of n, the ftmdamental network concepts are very close to 
their approximate CF-based analogs. This illustrates 
Observation 2. With the results in Table 3, one can easily 
verify Observation 3 and equation (16) in this example. 
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Example: block diagonal adjacency matrix 

In the following, we will consider a block diagonal adja- 
cency matrix where each block has constant adjacencies, 
i.e. 



Table 3: Network concepts in the constant Erdos-Renyi network. 



A = 




(17) 



We assume that the first and second blocks have dimen- 
sions n^x and rij x rij, respectively. Such a block diago- 
nal matrix can be interpreted as a network with two 
distinct modules. Setting = 0 results in the simple con- 
stant adjacency matrix, which we considered before. 

We denote by/^ = (1, 1,..., 1, 0, 0, 0) the vector whose 
first rii components equal 1 and the remaining compo- 
nents equal 0. Similarly, we define /2= (0, 0, ...,0, 1, 1, 
1) = 1 -/p To simplify the calculation of the conformity, 
we further assume that 



Network Concepts 


Fundamental 


Approximate CF-based 


Connectivity, 


(n- \)b 


nb 


Density 


b 


b " 

n-l 


Centralization 


0 


0 


Heterogeneity 


0 


0 


TopOverlapij 


b 


^ nb-\-l 
{n-l)b + l 


ClusterCoefj 


b 


b 



Ujijlj -1)^2 



<1. 



(18) 



ni(ni-l)?7i 

Then the conformity is uniquely defined by 

as one can show using equations (36) and (37) in the 
appendix. Further, using Proposition 10 in the appendix, 
one can show that the factorizability is given by 



F(A): 



ni(ni-l)bi 



9 9 ' 

^l(^l ~ 1)^1 + ^2(^2 ~ 1)^2 



(19) 



Table 4: Network concepts in the simulated block-diagonal network. 



Concept 



Fundamental 



CF-based 



Approx CF-based 



Connectivity, 
Density 

Centralization 

Heterogeneity 

TopOverlapjj 
ClusterCoefj 



(ni - l)bilndi<n^ -\- [rij - lyojlndi^n^ 

^1(^1 ~ 1)^1 + ^2(^2 ~ 1)^2 
(^l+^2)(^l+^2-l) 

^2((^l-l)^l+(^2-l)^2) 

{ui -h n2 - l)(ni -h n2 - 2) 



nibilndi<n^ 



{til n2)[ni(ni - l)^bi n2(n2 - l)^b2 



[ni(ni -l)?7i -Kn2(n2 -1)?72] 

bilndi<n^ +??2M>ni 



nibi 



{rii +n2)(ni -\-n2-l) 
nin2bi 



ini-l)bilndi<n^ 

ni(ni-l)bi 
(^i+^2)(^i+^2-l) 

(^1-1)^2^1 

{rii -h n2 - l)(ni -\-n2-2) (n^ -\-n2- l)(ni -H n2 - 2) 



n2 
bilndi<n^ 



Yiibi+l 
bilndi<n, 



The indicator function /nd(-) takes on the value I if the condition is satisfied and 0 otherwise. 
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In particular, if n^^ rij and = hj, i-^- if the adjacency 
matrix is comprised of two nearly identical blocks, the fac- 
torizability is F(A) ^1/2. Similarly, one can show that if 
the matrix A is comprised of B identical blocks, then F(A) 
«1/B. 

This block diagonal network allows one to arrive at 
explicit formulas for fundamental-, CF-based-, and 
approximate CF-based network concepts, see Table 4. 

In the following, we study the relationship between fun- 
damental network concepts and their approximate CF- 
based analogs in the limit when the block diagonal net- 
work becomes approximately factorizable. Specifically, 
we calculate network concepts in the limit hj^O when n^, 
n2 and are kept frxed. Under this assumption, h2-^ 0 is 
equivalent to F(A) 1 . Then, one can easily show that 



lim ConnectivitVi = 
lim Density = 
lim Centralization = 



Ui-l 
Hi 

ni-1 

Hi 
Hi 



Connectivity cF,app A' 
Density cF,app' 
Centralizationcp^^pp , 



lim Heterogeneity = Heterogeneity cp 



^^^^^^ ^opOverlapcF,app,^ > 



lim TopOverlapjj 

F(A)^l ^ 

lim ClusterCoefi = ClusterCoefcp awi- 

F(A)^l ' 

For reasonably large values of n^ (say n^ > 20), these limits 
illustrate Observation 2. Similarly, one can easily verify 
Observation 3 and equation (16) in the case when the fac- 
torizability F(A) is close to 1 and n^ is reasonably large. 

Discussion 

This paper does not describe a new software or method for 
constructing networks. Instead, it presents theoretical 
results which clarify the mathematical relationship 
between network concepts in module networks. A deeper 
understanding of network concepts may guide the data 
analyst on how to construct and use networks in practice. 
Our results will pertain to any network that is approxi- 
mately factorizable irrespective of its construction 
method. While the term 'factorizable' network is new, 
numerous examples of these types of networks can be 
found in the literature, e.g. [38]. A recent physical model 
for experimentally determined protein-protein interac- 
tions is exactly factorizable [39]. In that model, the 'affin- 
ity' a^j between proteins i and j is the product of the 
corresponding conformities. The conformities are approx- 
imately given by CF^ = exp{-K^) where is the number of 



hydrophobic residues in the i-th protein. Another related 
example is an exactly factorizable random network model 
for which the edges between pairs of nodes are drawn 
according to a linking probability function [40,41]. 

We find that in many applications, the conformity is 
highly related to the first eigenvector of the adjacency 
matrix. The idea of using a variant of the singular value 
decomposition for decomposing an adjacency matrix has 
been proposed by several authors [42-45]. However, we 
prefer to define the conformity as a maximizer of the fac- 



torizability function F^(i;) = 1 - - 



for 



the following reasons: First, the facto rizability satisfies 
that F^(CF) = 1 if, and only if, A is exactly factorizable net- 
work with a^j = CFfiFj. Second, we prefer to define the con- 
formity without reference to the diagonal elements a^^ of 
the adjacency matrix. Third, the definition naturally fits 
within the framework of least squares factor analysis 
where conformity can be interpreted as the first factor 
[46]. An algorithm for computing the conformity in gen- 
eral networks is presented in the appendix. While network 
analysis focuses on the adjacency matrix, factor analysis 
takes as input a correlation or covariance matrix. In mod- 
ule networks, the first factor (conformity) corresponds to 
a normalized connectivity measure, see equation (16). 
Future research could explore the network interpretation 
of higher order factors. 

The topological structure of complex networks has been 
the focus of numerous studies, e.g. [7,8,16-18,20,38,47]. 
Here we explore the structure of special t3^es of networks, 
which we refer to as module networks. 

To derive results for factorizable module networks, we 
define several novel terms including a measure of network 
factorizability F(A), conformity, CF-based network con- 
cepts, approximate CF-based network concepts. 

The first result (Observation 1) uses both PPI and gene co- 
expression network data to show empirically that subnet- 
works comprised of module nodes are often approxi- 
mately factorizable. This insight could be interesting to 
researchers who develop module detection methods. 
Approximate factorizability is a very stringent structural 
assumption that is not satisfied in general networks. 
While modules in gene co-expression networks tend to be 
approximately factorizable if the corresponding expres- 
sion profiles are highly correlated, the situation is more 
complicated for modules in PPI networks: only after 
replacing the original adjacency matrix by a 'smoothed 
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out' version (the topological overlap matrix), do we find 
that the resulting modules are approximately factorizable. 

The second result (Observation 2) shows that fundamen- 
tal network concepts are approximately equal to their 
approximate CF-based analogs in approximately factoriz- 
able networks (e.g. modules). While fiindamental net- 
work concepts are defined with respect to the adjacency 
matrix, approximate CF-based network concepts are 
defined with respect to the conformity vector. The close 
relationship between fundamental and approximate CF- 
based concepts in module networks can be used to pro- 
vide an intuitive interpretation of network concepts in 
modules. We demonstrate that these high correlations 
between module concepts remain significant even after 
adjusting the analysis for differences in module size [see 
Additional file 2]. 

The third result (Observation 3) shows that the mean 
clustering coefficient is determined by the density and the 
network heterogeneity in approximately factorizable net- 
works. Further, the topological overlap between two 
nodes is determined by the maximum of their respective 
connectivities and the heterogeneity. Thus, seemingly dis- 
parate network concepts satisfy simple and intuitive rela- 
tionships in these special but biologically important t3^es 
of networks. 

The fourth result (Observation 4) is that in approximately 
factorizable networks, fundamental network concepts can 
be expressed as simple functions of the connectivity. 
Under mild assumptions, we argue that the clustering 
coefficient and the topological overlap matrix can be 
approximated by simple functions of the connectivity. 

Our empirical data also highlight how network concepts 
differ between subnetworks of 'proper' modules and the 
subnetwork comprised of improper (grey) module nodes, 
see Table 1 . For all applications, we find that proper mod- 
ules have high factorizability, high density, high mean 
conformity. Based on our theoretical derivations, it comes 
as no surprise that proper modules also have a high aver- 
age clustering coefficient and a high centralization when 
compared to the improper module. But we find no differ- 
ence in heterogeneity between proper and improper mod- 
ule networks. 

As a consequence of approximate factorizability, network 
concepts with disparate meanings in social network the- 
ory are closely related in module networks. Our results 
shed some light on the relationship between network con- 
cepts traditionally used by social scientists (e.g. centraliza- 
tion, heterogeneity) and concepts used by systems 
biologists (e.g. topological overlap). For example, equa- 
tion (13) shows that in module networks, the topological 



overlap between a hub gene and other module genes is 
related to the centralization. 

Conclusion 

Using several protein-protein interaction and gene co- 
expression networks, we provide empirical evidence that 
subnetworks comprised of module nodes often satisfy an 
important structural property, which we call 'approximate 
factorizability'. In these types of networks, simple rela- 
tionships exist between seemingly disparate network con- 
cepts. Several network concepts with very different 
meanings in general networks turn out to be highly corre- 
lated across modules. These results are pertinent for sys- 
tems biology since a biological pathways may correspond 
to an approximately factorizable module network. 

Methods 

The adjacency matrix and notation 

We study the properties of an adjacency matrix (network) 
A that satisfies the following three conditions: 

(A.l) A is symmetric and has dimension n x n. 

(A.2) The entries of A are bounded within [0, 1], that is, 0 
<ay < 1 for all 1 < i,j < n. 

(A. 3) The diagonal elements of A are all 1, that is, = 1 
for all 1 < I < n. 

Matrix and vector notation 

We will make use of the following notations. We denote 
by the unit vector whose i-th entry equals 1 and by 1 the 
'one' vector whose components all equal 1. The Frobenius 

matrix norm is denoted by ||A^||p = ^^j^j^f • The 

transpose of a matrix or vector is denoted by the super- 
script ^ . For any real number p, we use the notation MP and 
vP to denote the element- wise power of a matrix M and a 
vector V respectively. We define the function Sp( • ) for a 

vector V as Sp{v) = ff = {vP)^l. Further denote by I the 
identity matrix and by diag(f 2) a diagonal matrix with its 
i-th diagonal component given by Vf ,i = 1, ...,n. We 
define the maximum function max(M) as the maximum 
entry of matrix and max(i; ) as the maximum entry of 
the vector v. Similarly we define the minimum function 
min( • ). Also, we define mean{v) = S^{v)/n and vanance{v) 
= S^{v)ln-{S,{v)lnY. 
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Uniqueness of the conformity for an exactly factorizable 
network 

One can easily show that the vector CF is not unique if an 
exactly factorizable network contains only n = 2 nodes. 
However, for n > 2 the conformity is uniquely defined 
when dealing with a weighted network where > 0. 

Specifically, we prove the following statement. If A is an n 
X n (n > 3) dimensional adjacency matrix with positive 
entries [a^ > 0), then the system of equations in (7) has at 
most one solution CF with positive entries. If the solution 
exists, it is given by 



CFi = 



l/(2(n-l)) 



n-2 



(20) 



where pj = Y\_^j-i^ij denotes the 'product connectivity' of 
the i-th node. 

Proof: by assumption, we have = CFf^F^ for a positive 

vector CF and n > 3. Multiplying both sides of equation 
(7) yields 

/ ^ \2(n-l) 



Since ITm^^/ positive, we find 

1 

nLCfl=(nmn/^m^^m)^- similarly, eliminat- 
ing the i-th row and column from A yields 



2(n-l) 



|2(n-l) 



. Since CFi = J^jl^CF/ /n^^jCi] , we conclude that CF^ is 
uniquely defined by 



CR=- 



m=l 



1 

n-2 



(n^p™) 



l/(2(n-l)) 



Network concept functions and fundannental network 
concepts 

In general, we define a network concept function to be a ten- 
sor valued function (e.g. the connectivity vector) that 



takes a square matrix (e.g. the network adjacency matrix) 
as input. 

Denote by M = [m^^] a general n x n matrix. Then we will 
study the following network concept functions: 



Connectivity i{M) 

Density{M) 
Centralization{M) 

Heterogeneity{M) 

TopOverlapij[M) 

ClusterCoefi{M) 



I] 



n{n - 1) 



n-2 



max(Adl) _ . ^ . .^ 1 

— ^- - Density[M) , 

n-1 I 



n{fMMl) 



■1, 



ejMMej -\- ejMej 



mm{eiMl e]Ml} + 1 - e-f^e^ 



ejMB^MCi 



(21) 



where the components of matrix in the denominator 
of the clustering coefficient function are given by h^j = 1 if 
i ^ j and b^^ = Indlm^^ > 0). Here the indicator function 
lnd[ ■ ) takes on the value 1 if the condition is satisfied and 
0 otherwise. 

For the sake of brevity, we study only a limited selection 
of network concept functions and do not claim that these 
are more important than others studied in the literature. 
Our general formalism for relating fundamental network 
concepts to their approximate CF-based analogs should 
allow the reader to adapt our derivations to alternative 
concepts as well. 

Now we are ready to define the fundamental network con- 
cepts that are studied in this article. 

Definition 5 (Fundamental Network Concept) The fun- 
damental network concepts of a network A are defined hy eval- 
uating the network functions (equation (21)) on A- 1, i.e. 

FundamentalNetworkConcept = NetworkConcept{A - 1). 

As special cases of this definition, we find the following 
concepts. The connectivity (also known as degree) of the 
i-th node is given by 



fej = Connectivity i{A - 1) ■ 



■1 



dij. 
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The line density [13] equals the mean adjacenq^, i.e 
X i X j^i ^ij _ Si (fe) _ mean{k) 



Density{A - 1) ■ 



n[n - 1) n[n - 1) n - 1 



(22) 

For notational convenience, we sometimes omit the refer- 
ence to the adjacency matrix and simply write Density to 
denote the fundamental network concepts. 

The normalized connectivity centxalization (also known 
as degree centralization) [14] is given by 

Centralization(A -I)= ^ \ "^^^^^ - Density \= (max(fe) - meanCk)). 

n-2y n-1 J (n-2)(n-l) 

(23) 

Our definition of the network heterogeneity equals the 
coefficient of variation of the connectivity distribution, 
i.e. 



^ Jvariance(k) fnSTffel 

Heterogeneity[A-I) = ^ —= ^^-l. 

mean{k) ^ Si(fe)^ 

(24) 

Note that Heterogeneity{b * M) = Heterogeneity{M) for a 
scalar b ^0. 

The clustering coefficient of node i is a density measure 
of local connections, or 'cliquishness' [19,20]. Specifi- 
cally, 

ClusterCoefi = ClusterCoefi {A -!) = -!-- ^'*'^m*t,i 



(25) 

The topological overlap between nodes i and j reflects 
their relative interconnectedness. It is defined by 



TopOverlap^j = TopOverlapiAA -I) = — : 



kj + ^ij 



mm{ki,kj}-\-l- dfj 
(26) 



where /y = Z^^g-a,-^a^ 



Network concepts in exactly factorizable networks 

In the following, we will present explicit formulas for the 
fundamental network concepts in Definition 5 when the 
adjacency matrix A is exactly factorizable, i.e. if = 
CFf^Fy We define the CF-based adjacency matrix as fol- 
lows 



where diag[CF^) denotes the diagonal matrix with diago- 
nal elements CF^ ,i = 1 ...n. Then one can easily show that 
for exactly factorizable networks 

NetworkConcept{A -I) = NetworkConcept{AQp - 1) . 

(28) 

Using our definition of network concept functions in 
equations (21), one can easily derive the following formu- 
las for NetivorkConcept{AQp - 1) in terms of the quantities 

Sp(Cf)=ZiCf^P. 

ConnectivitYiiAcp - 1) = CF^SiiCF] - CF,^ , 
_ S^jCFf-S^jCF) 



DensitY{A(^p - 1) 



n(n-l) 



^ 1- ■ r A n ( m3iK(ConnectmtY[AcF - 1)) ^ . ^ 

Centralization{Arp - 1) = -^^-^ ^-^ - Density{Arp - 1) I 

n-2\ n-l I 



ClusterCoefi{Acp - I) 



TopOverlapij{Acp - 1) 



{S,[CFf-S2[CF)f 



(Si(CF) - CF^f - [S2[CF] - CF^^f ' 

CFiCFj{S2{CF) - CF^ - CFf) + CFfiFj 



mm{CFi (Si (CF) - CF^ ), CFj (Si (CF) - CFj )) + 1 - CFiCFj 

(29) 

Approximate CF-based network concepts in general 
networks 

When A^p - 1 is used as input of a network concept func- 
tion, it gives rise to a CF-based network concept as 
detailed in the following 

Definition 6 (CF-based Network Concepts) Assume that 
the conformity vector CF can he defined for a general adjacency 
matrix A. Then the CF-hased network concepts are defined by 
evaluating the network concept functions on A^p-l = CF CF^- 
diag[CF^), i.e. 

NetworkConceptQp := NetworkConcept[A(^p - 1). 

By definition, fundamental network concepts are equal to 
their CF-based analogs if A is exactly factorizable. 

In the following, we define approximate CF-based analogs 
of the fundamental network concepts. The theoretical 
advantage of these approximate CF-based concepts is that 
they satisfy simple relationships. Define the approximate 
CF-hased adjacency matrix as follows 



^CF,app=CFCFr. 



Acp := CF CF^- diag{CF^) + I, 



(27) 



(30) 

Note that only the diagonal elements differ between 
^CF,app Aqp . We define the approximate CF-based net- 
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work concepts by using ACp^^pp as input of the network 
concept functions as detailed in the following 

Definition 7 (Approximate CF-based Network Con- 
cepts) The approximate CF-based network concepts of a net- 
work A with conformity CF are defined hy evaluating the 
network functions (equations (21)) onAQp^^pp= CF CF^, i.e. 

NetworkConceptQp ^pp := NetworkConcept{A(jp^pp). 

In approximately factorizable networks, fundamental 
network concepts are approximately equal to their 
approximate CF-based analogs 

Here we will provide a heuristic derivation of Observation 
2. Since the components of CF are positive, one can easily 
show that S^{CF) < SjiCFy. For many large, exactly factor- 
izable networks, the ratio S^[CF)/S2[CFy is close to 0. 
Since S^[CF)/S2iCFy = 
II l|2 /ll l|2 

I i^CF - 0 - ^CF,app \\p I II ^CFapp % ' ^^is implies that A^p - 
7 » A^p^^pp. Since the network concept functions are contin- 
uous functions, this implies NetworkConcept[A(^p - 1) ^ Net- 
workConcept{A(^P^pp). These derivations are summarized in 
the following 

Observation 8 (Approximate Formulas for CF-based 
Concepts) If S^{CF)IS2{CFY^ 0, then 

NetworkConcept[A(jp - 1) « NetworkConcept[AQp ^pp). 
(31) 

In particular, for exactly factorizable networks (i.e. A - 1 = 
A(^P - I), this means that the fundamental network con- 
cepts can be approximated by their approximate CF-based 
analogs. 

In our real data applications, we show empirically that 
equation (31) holds even in networks that satisfy the 
assumptions of Observation 8 only approximately. 

In the appendix (equation (43)), we define a measure of 
network factorizability as follows 

\\(A-I)-(ArF-I)\\l 
F(A) = 1 - ^ ^ (32) 

Thus, in approximately factorizable networks (i.e. F(A) 
close to 1), A - / can be approximated by A^^^^- For a con- 
tinuous network functions, this implies 

NetworkConcept{A -7) » NetworkConcept{ACF - 7), 



i.e. the fundamental network concepts are approximately 
equal to their CF-based analogs in approximately factoriz- 
able networks. Observation 8 states that 

NetworkConcept{A(jp-I) » NetworkConcept{AQp^pp). 

Combining the last two equations leads to NetworkCon- 
cept{A - 7) » NetworkConcept{A(jp^pp). These derivations are 
summarized as follows. 

In approximately factorizable networks, the fundamental 
network concepts are approximately equal to their 
approximate CF-based analogs, i.e. 

FundamentalNetworkConcept « NetworkConcept^p^pp. 

Construction of gene co-expression networks 

Gene co-expression networks are constructed from micro- 
array data that measures the transcriptional response of 
cells to changing conditions. We consider the case of n 
genes with gene expression profiles across m microarray 
samples. Thus, the gene expression profiles are given by 
an n X m matrix 

X = [Xy] = [x^Xj U x^y, i = 1, n;j = 1, m, 
(33) 

where the i-th row xj is the transcriptional responses of 
the i-th gene. 

Recently, several groups have suggested thresholding the 
pairwise Pearson correlation coefficient cor^x^, xf} in order 
to arrive at gene co-expression networks, which are some- 
times referred to as 'relevance' networks [11,32]. In these 
networks, a node corresponds to the gene expression pro- 
file of a given gene. The corresponding adjacency matrix is 
determined from a measure of co-expression between the 
genes. In the examples below, we will use the absolute 
value of the Pearson correlation coefficient between the 
gene expression profiles to measure co-expression. 

To transform the co-expression measure into an adja- 
cency, one can make use of an adjacency function. The 
choice of the adjacency function determines whether the 
resulting network will be weighted (soft-thresholding) or 
unweighted (hard-thresholding). The adjacency function 
is a monotonically increasing function that maps the 
interval [0, 1] into [0, 1]. A widely used adjacency func- 
tion is the signum function which implements 'hard' 
thresholding involving the threshold parameter r. Specif- 
ically, 

= Signum[\cor[xp Xj)\, r) = Ind{\cor{xp Xj)\ > r), 

(34) 
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where the indicator function Ind( • ) takes on the value 1 if 
the condition is satisfied and 0 otherwise. Hard threshold- 
ing using the signum function leads to intuitive network 
concepts (e.g., the node connectivity equals the number 
of direct neighbors), but it may lead to a loss of informa- 
tion: if T has been set to 0.8, there will be no connection 
between two nodes if their similarity equals 0.79. 

To avoid the disadvantages of hard thresholding, we pro- 
posed a 'soft' thresholding approach that raises the abso- 
lute value of the correlation to the power /?> 1 [21], i.e. 

dy = Power { I cor{xp Xj) \, ^) = \ cor{xp Xj) | A (35) 
In our yeast cell cycle gene co-expression network analy- 
sis, we followed the analysis steps described in [21]. 
Briefly, we used the 2001 most varying and connected 
genes. Next, we used the power adjacency fiinction with 
= 7 (equation (35)) to construct a weighted gene co- 
expression network and the signum adjacency function 
with T= 0.65 (equation (34)) to construct an unweighted 
network. 

Using our R software tutorial, the reader can easily verify 
that our conclusions are highly robust with respect to a) 
different ways of constructing co-expression networks and 
b) different ways of constructing modules. 

Availability and requirements 

An R implementation and the data can be obtained from 
the internet: http://www.genetics.ucla.edu/labs/horvath/ 
ModuleConformity/ ModuleNetworks 

Appendix: node conformity and factorizability of 
a general network 

Equation (20) provides an explicit formula for the con- 
formity of a weighted, exactly factorizable network. For a 
general, non-factorizable network, we describe here how 
to compute the conformity by optimizing an objective 
function. In the following, we assume a general n xn adja- 
cency matrix A where n > 2. Let v = {virVj, ...,f„)^be a vector 
of length n. We could define the conformity as a vector v* 
that minimizes the following objective function f{v) = Z/ 
^j^ii^ij - ^i^j)^- But instead, we find the following equiva- 
lent formulation as a maximization problem more useful 
since it naturally gives rise to a measure of factorizability. 

Specifically, we define the objective function 



Fa(v) :=1- ^ ^ = 1-^^ ^• 

(36) 

It is clear that F^(CF) = 1 for an exactly factorizable net- 
work with = CFfiFj for i ^ j. Note that F^{v) < 1 and F^(0 



) = 0. One can easily show that if r* maximizes F^(v), then 
-V* also maximizes Fj^{v). Further, all components of v* 
must have the same sign since otherwise, flipping the sign 
of the negative components leads to a higher value of 
F^(i;). This leads us to the following 

Definition 9 (Conformity, Factorizability) We define the 
conformity CF as the vector with non-negative entries that 
maximizes F^(v). // there is more than one such maximizer, 

then a maximizer closest to k/ ^JSl{k) is chosen. Further, we 

define the factorizability F (A) as the corresponding maximum 
value F^(CF). 

Our definition of the conformity is a generalization of 
Definition 7 since F(A) = 1 if, and only if, A is exactly fac- 
torizable with a^ = CFfiFjfor i ^ j. The advantages of this 
Definition are briefly described in the discussion section. 

In general, F^(f ) may have multiple maximizers as can be 
demonstrated with the block diagonal simulated example 
(equation (17)) by choosing n^ = nj and = hj. By form- 
ing the first derivative of the factorizability ftinction F^(i;) 
in terms of v., one can show that a local maximum satisfies 



J^a^CF^=CFiJ^CFf, 



(37) 



I.e. 



(A - / diag{CF^))CF = CF\\CF\\l. (38) 

Proposition 10 (Expressions for the Factorizability) If 

the conformity vector CF of the adjacency matrix A exists, then 
the factorizability F(A) is given by 



F(A) = 



I ^CF lip _ S2{CFf-S^{CF) 



(39) 



A-/ 



A-/ 



[A-l) + diag{CFf - CF CF^ 



Proof Since F(A) = 1- 

it will be sufficient 
\\[A-I)-iAcp-I)fp=\\A-lfp 



\\A-I 
to 



2 
F 

show 

|2 

If 



that 
From the 



definition of the Frobenius norm of a matrix B, one can 
show that II B ||^ = trace{B^B) where the trace of a matrix is 
the sum of its diagonal elements. Thus, 

\\iA-I)-iAcF-I)fp=\\A-lfp^AcF-lfp-2xtrace[[A-I)^^^ 

. Using equation (38), we find that trace{{A - 7)(A(^^- 7)) = 
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tr((A - I)CF CF^) - tr((A - I)diag{CF^)) = CF^(A - I)CF 



\^CF ~^||f • 



Thus, 



II {A -I)- {Acp - /) g = II A - / Ij - II - / g The 
remainder of the proof is straightforward. 

Equation (38) suggests that the conformity is an eigenvec- 
tor of the 'hat' adjacency matrix 



A :=A-I + diag{CF^). 

An algorithm for computing the conformity is based on 
the following 

Lemma 1 1 If A denotes a symmetric real matrix with eigenval- 
ues dy d^sorted according to their absolute values, i.e., \d^\ 

> 1^2! > ... > and the corresponding orthonormal eigenvec- 

2 



A-w 



is minimized at 



tors are denoted by Uy m„, then 
V* = ^\di I 

The proof can be found in Horn and Johnson (1991). 

Denote by CF(i - 1) an estimate of the conformity CF. 
Next define 

A (i - 1) = A - 7 + diag{CF{i - 1)2). (40) 
Define a new estimate of the conformity by 



CF(0 = V^i(i-l)ui(i-l), 



(41) 



where di (i - 1) and (i - 1) denote the largest eigen- 
value and corresponding unit length eigenvector of A (i - 
1). One can easily show that all the components of Ui (f 

- 1 ) must have the same sign and we assume without loss 
of generality non-negative components. Lemma 1 1 with A 

= A (i - 1 ) implies that 



A - / diag{CF{i -if)- CF{i - l)CF{i - if 



A - / + diag{CF{i - 1) ) - CF{i)CF{i)'' 



Considering the diagonal elements, one can easily show 
that 



A - / diag{CF{i -if)- CF{i)CF{iy 

> I A - 7 diag{CF{i)^)- CF{i)CF{if 
Thus, we arrive at the following 



F^(CF(i))>F^(CF{i-l)), (42) 

which suggests a monotonic algorithm for computing CF. 

Equation 1 6 suggests to choose k/ ^JSl(k) as a starting 

value of the algorithm. These comments give rise to the 
following 

Definition 12 (Algorithmic Definition of Conformity, 
Facto rizability) For a general network A, set CF(1) = kj 
^Si{k) and apply the monotonic iterative algorithm described 

by equations (40) and (41). If the limit CF(oo) exists, we 
define it as the conformity CF = CF(oo). Further, we define the 
network factorizability as 



||(A-J)-(AcF-J)|g 
\\A-I\\l 



(43) 



Note that the conformity satisfies equation (38) by defini- 
tion of convergence. One can easily show that 0 < F(A) < 
1 . Further, one can easily show that F(A) = 1 if, and only 
if, A is exactly factorizable with = CFfiF^, i.e. A - 7 = A^^ 
-I. 

The algorithm described by equations (40) and (41) is 
monotonic (equation (42)). It is a special case of an algo- 
rithm described in [46] for fitting a least squares factor 
analysis model with one factor. Theoretical properties of 
the algorithm are discussed in [46] and [48]. 

We find that for most real networks, the conformity is 
highly related to the first eigenvector of the adjacency 
matrix, i.e. the conformity vector CF is roughly equal to 
.Jd^ Ui where d^ is the largest eigenvalue of A and is the 

corresponding unit length eigenvector with positive com- 
ponents. 
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